The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages

نویسندگان

  • Emanuela Cresti
  • Fernanda Bacelar do Nascimento
  • Antonio Moreno-Sandoval
  • Jean Véronis
  • Philippe Martin
  • Khalid Choukri
چکیده

The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus

C-ORAL-ROM is a multilingual corpus of spontaneous speech of around 1.200.000 words representing the four main Romance languages: French, Italian, Portuguese and Spanish.. The resource will be delivered in standard textual format, aligned to the audio source in a multimedia edition. C-ORAL-ROM aims to ensure at the same time a sufficient representation of spontaneous speech variation in each la...

متن کامل

The C-ORAL-BRASIL I: Reference Corpus for Spoken Brazilian Portuguese

C-ORAL-BRASIL I is a Brazilian Portuguese spontaneous speech corpus compiled following the same architecture adopted by the C-ORAL-ROM resource. The main goal is the documentation of the diaphasic and diastratic variations in Brazilian Portuguese. The diatopic variety represented is that of the metropolitan area of Belo Horizonte, capital city of Minas Gerais. Even though it was not a primary g...

متن کامل

Acoustic-phonetic decoding of different types of spontaneous speech in Spanish

This paper presents preliminary acoustic-phonetic decoding results for Spanish on the spontaneous speech corpus CORAL-ROM. These results are compared with results on the read speech corpus ALBAYZIN. We also compare the decoding results obtained with the different types of spontaneous speech in C-ORAL-ROM. As the most important conclusions, the experiments show that the type of spontaneous speec...

متن کامل

Data in Your Language : The ECI

In this paper we describe the contents and the method of production of the ACL European Corpus Initiative Multilingual Corpus 1 (ECI/MC1). This is a large multilingual electronic text corpus, containing 97 million words in 27 (mainly European) languages. It is available cheaply on CDROM. Most of the texts in the corpus are marked up using a fully-validated SGML document type description based o...

متن کامل

Winpitch Corpus, a Software Tool for Alignment and Analysis of Large Corpora

Description of endangered languages normally starts with the collection of speech data, which are then segmented into various phonological, prosodic, morphological and syntactic units. In this process, the (phonetic ) transcription is the most critical part, and user friendly tools are essential to tackle any sizeable work in a reasonable amount of time. The software program WinPitch Corpus add...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004